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Bias  and  Information  of  Bayesian  Adaptive  Testing 


Since  test  scores  ere  typically  used  to  differentiate  among  persons,  one 
highly  desirable  property  of  a  test  would  be  that  it  measure  equally  well  at  all 
points.  Another  consideration  is  that  it  aeaaure  each  person  precisely.  Thus, 
an  "ideal"  test  would  have  e  high,  horisontal  information  f (Action. ,  Unfortu¬ 
nately  ,  this  ideal  cannot  noraally  be  achieved  in  a  fixed-length  conventional 
teat  that  draws  its  iteas  from  a  such  larger  fixed  pool  of  teat  lteas.  Ordinar¬ 
ily,  soae  trade  offs  aust  be  aade.  Relatively  high  lnforaatlon  at  a  point  can 
be  achieved  by  "peaking"  the  test,  that  is,  constructing  it  of  the  Met  discrim¬ 
inating  iteas  in  a  narrow  range  of  difficulty*  A  relatively  flat  but  low  infor¬ 
mation  function  can  be  achieved  by  selecting  equldlscrlulnatlng  Iteas  having  a 
wide  range  of  itea  difficulty  values.  The  only  way  to  approximate  a  high,  flat 
information  function  is  to  administer  to  each  person  the  subset  of  iteas  that 
provides  the  most  information  at  his/her  level  of  ability,  6.  The  problem  with 
this  is  obvious:  d  is  unknown  before  the  teet  is  administered. 

An  adaptive  test  can  select  items  during  the  course  of  testing  in  such  a 
way  as  to  attempt  to  maximise  the  information  obtained  for  each  examinee,  litis 
may  be  done  either  by  simple  branching-administering  a  more  difficult  item  af- 
'  ter  a  correct  answer  and  an  easier  item  after  an  incorrect  answer— or  by  more 
elaborate  techniques.  Owen's  (1969,  1975)  Bayesian  adaptive  testing  strategy 
estimates  6  after  each  item  response,  then  selects  the  unused  test  item  that  is, 
in  one  sense,  the  most  "informative”  at  the  current  estimated  ability  level. 

B is  result  is  that  different  persons  take  different  sets  of  test  items;  each  set 
of  test  items  spans  a  range  of  difficulty  levels  approximately  tailored  to  pro¬ 
vide  maximal  information  about  the  individual  examinee. 

The  information  function  of  the  test  scores  derived  from  any  adaptive  test¬ 
ing  procedure  should  be  (1)  flatter  than  that  of  a  peeked  test  of  the  seme 
length  and  constructed  from  the  ssm  item  pool  and  (2)  higher  then  that  of  a 
rectangular  test  of  the  same  length  drawn  from  the  sane  item  pool.  The  height 
of  the  adaptive  teat's  Information  function  will  be  determined  in  large  part  by 
the  discriminations  end  guessing  parameters  of  the  constituent  items  of  the  item 
pool  as  well  as  by  test  length.  The  flatness  of  the  information  curve  (and  to 
s one  extent  its  height)  will  depend  largely  on  the  range  of  item  difficulties  in 
the  pool  and  on  the  effectiveness  of  the  adaptive  item  selection  procedure. 

Urry  (1971)  conducted  monte  carlo  simulations  of  Owen's  (1969,  1975)  se¬ 
quential  procedure  using  three  different  simulated  item  banka:  two  banks  of 
"ideal”  item  parameters  end  one  bank  of  items  with  the  same  parens tars,  as  the 
VSAT  (Lord,  1968).  Vrry's  item  Bank  A  had  20  equidlscrlminatlng  items  (a  -  1.6) 
at  each  of  five  equally  spaced  levels  on  the  ability  continuum;  his  IteaTBenk  B 
employed  five  item*  of  the  ssm  (a  *  1.6)  discriminations  at  each  of  20  ability 
levels;  and  Item  Bank  C  employed  the  parameters  actually  occurring  la  the  V8AT. 
Banks  A  and  B  required  an  average  of  just  over  11  items  to  test  termination. 

Bank  C  required  an  average  of  27*5  Items  to  termination.  The  other  noteworthy 
result  of  Urry'*  (1971)  simulation  studies  urns  the  megs 1 rude  of  tin  fidelity 
coefficients*  For  simulated  smemimese  draws  randomly  from  a  normal  (0*1)  popu¬ 
lation,  the  observed  correlatioae  of  .936  (Item  Bank  A}  and  ,919  (Item  Bank  B) 
ere  quite  high  la  view  of  the  relatively  short  test  lengths  involved. 


Jensema  (1972)  simulated  Oven's  (1969,  1975)  approach  to  Bayesian  testing 
using  the  actual  item  responses  of  100  live  examinees  to  58  mathematics  items 
drawn  from  four  conventional  pre-college  tests  taken  at  full  length  by  the  exam¬ 
inees.  From  a  record  of  their  item-by-ltem  actual  test  performance,  a  computer 
program  constructed  artificial  protocols  of  their  responses  to  the  items  that 
would  have  been  administered  by  Bayesian  sequential  tests  under  two  different 
conditioner  with  and  without  differential  prior  information  about  examinees' 
abilities.  Parallel  to  these  two  "real  data"  simulations,  Jensema  carried  out 
monte  carlo  simulations  of  the  Bayesian  procedure.  These  simulations  used  100 
simulated  examinees  and  items  with  logistic  ogive  parameters  identical  to  the  58 
real  items.  Item  scores  were  generated  as  a  stochastic  function  of  ability, 0  , 
and  the  parameters  of  each  item.  The  adaptive  tests  were  terminated  in  each 
instance  when  the  posterior  variance  of  the  Bayesian  ability  estimate  fell  below 
.0625  or  when  30  items  had  been  administered,  whichever  occurred  first. 

In  the  real-data  simulation,  mean  test  length  was  about  27  items,  with  or 
without  differential  initial  ability  estimates.  The  Bayesian  estimates  corre¬ 
lated  about  .86  with  scores  on  a  weighted  composite  of  the  four  conventional 
tests  from  which  the  item  bank  wes  selected.  Jensema  did  not  report  a  correla¬ 
tion  of  ability  with  test  length  or  with  precision  of  estimate,  but  he  did  ob¬ 
serve  that  the  posterior  variance  criterion  terminated  the  testing  only  in  the 
upper  portions  of  the  distribution  of  estimated  ability.  Jensema  interpreted 
these  results  to  imply  that  the  Item  pool  was  unsatisfactory  for  adaptive  test¬ 
ing  in  the  lower  ability  levels  due  to  the  low  discriminations  of  the  items  in 
that  region  of  the  difficulty  continuum.  His  monte  carlo  results  using  the  same 
item  pool  resulted  in  virtually  identical  mean  test  lengths  and  in  correlations 
of  .92  between  estimated  ability  and  true  ability.  Ha  concluded,  in  part,  that 
a  satisfactory  item  pool  for  adaptive  testing  needs  to  employ  very  highly  dis¬ 
criminating  items  uniformly  distributed  on  the  difficulty  continuum.  Another 
conclusion  he  reached — this  one  on  the  basis  of  monte  carlo  simulation  with  ide¬ 
al  item  banks — was  that  for  most  purposes  little  was  to  be  gained  by  the  use  of 
prior  information  about  examinees  to  determine  a  variable  initial  0  estimate. 
Jensema  found  that  using  differential  prior  information  resulted  in  an  average 
savings  of  only  one  test  item. 

In  another  monte  carlo  study  of  Owen's  Bayesian  strategy,  Jensema  (1974) 
examined  the  effects  of  item  parameters  and  Bayesian  test  length  on  test  reli¬ 
ability.  He  showed  that  reliability  is  directly  related  to  the  posterior  vari¬ 
ance  of  the  Bayesian  ability  estimate;  hence,  using  a  specific  value  of  that 
posterior  variance  as  a  termination  criterion  determines  the  reliebillty  of  the 
test.  Jensema  showed  that  the  average  number  of  items  required  to  attain  that 
reliability  varies  as  a  function  of  the  item  parameters.  With  items  uniformly 
distributed  on  difficulty,  the  higher  the  item  discrimination,  the  shorter  the 
test. 

HcBride  (1977;  McBride  &  Weiss,  1976)  also  studied  characteristics  of  the 
ability  estimates  resulting  from  Owen's  (1969,  1975)  strstegy.  These  monte 
carlo  simulations  Involved  (1)  an  ideal  item  pool  with  variable  test  length;  (2) 
the  effects  of  guessing  and  item  discrimination  in  a  perfect  item  pool;  (3)  the 
effects  of  fixed  test  length;  and  (4)  the  effects  of  ability  level  and  item  pool 
configuration.  In  the  first  three  studies,  the  performance  of  the  adaptive  twit 
was  evaluated  on  overall  indices  Including  the  overall  bias  and  mean  absolute 


error  of  the  ability  estimates,  the  correlation  of  ability  estimates  with  true 
ability  estimates  (fidelity),  and  correlations  of  true  and  estimated  ability 
levels  with  errors  and  test  length. 

The  fourth  study  evaluated  the  performance  of  this  testing  strategy  in  an 
item  pool  with  no  correlation  between  difficulty  and  discrimination  parameters, 
and  using  items  with  high  negative  and  high  positive  correlations  between  these 
parameters.  In  contrast  to  the  other  studies,  characteristics  of  the  ability 
estimates  were  examined  as  a  function  of  true  6;  dependent  variables  included 
bias  and  information  conditional  on  6.  Contrasting  with  the  first  three  stud¬ 
ies  ,  which  showed  little  overall  mean  bias  and  information.  Study  4  showed  se¬ 
vere  bias  in  the  conditional  6  estimates  for  all  three  item  pool  configurations. 
Estimates  of  6  were  unbiased  only  for  five  8  values  between  8  ■  1.0  to  -1.0;  for 
low  6  values,  e  was  overestimated  and  high  6  values  were  underestimated.  In 
addition,  the  information  curves  for  the  three  item  pool  configurations  were  not 
high  and  flat  as  would  be  expected,  at  least  when  the  ideal  item  pool  was  used 
in  which  difficulty  and  discrimination  parameters  were  un correlated. 

Gorman  (1980)  also  examined  the  bias  and  information  of  scores  produced  by 
Owen's  Bayesian  testing  procedure.  These  analyses  were  based  on  two  “ideal'* 
item  pools  with  discriminations  of  ^  -  .8  and  1.6,  in  irtilch  101  items  were  rec¬ 
tangularly  distributed  in  difficulty,  and  both  true  and  estimated  item  parame¬ 
ters  were  used.  Gorman  also  studied  the  effect  of  applying  a  correction  for 
regression  (proposed  by  Urry,  1977)  to  ability  estimates  from  Owen's  testing 
procedure,  designed  to  reduce  bias  in  the  estimates.  His  results  show  substan¬ 
tial  bias  in  the  uncorrected  6  estimates,  with  positive  bias  for  8  levels  below 
zero,  negative  bias  for  8  levels  above  zero,  and  higher  levels  of  bias  for  the 
less  discriminating  items.  His  data  also  show  that  Urry's  correction  was  not 
entirely  successful  in  eliminating  the  bias,  since  the  corrected  8  estimates  for 
8  levels  above  zero  resulted  in  positive  bias.  Since  Gorman's  study  used  an 
ideal,  but  finite,  item  pool,  however,  his  results  may  be  partially  item  pool 
dependent.  In  addition,  Gorman's  study  did  not  attempt  to  determine  the  cause 
of  the  bias  in  the  8  estimates  but  simply  examined  one  possible  approach  to  re¬ 
ducing  it. 


Purpose 


The  present  study  was  designed  to  further  investigate  the  nature  of  the 


bias  and  the  information  characteristics  of  Owen's  Bayesian  adaptive  testing 


strategy  and  to  examine  possible  causes  of  the  bias.  Factors  investigated  in¬ 
cluded  (1)  the  effects  of  item  discrimination,  (2)  the  effects  of  fixed  vs. 
variable  test  length,  and  (3)  the  effect  of  an  accurate  prior  ^  estimate. 

ihd*. 


Method 


Monte  carlo  simulation  of  Owen's  adaptive  test  was  used...  Unlike  some  pre¬ 
vious  simulation  studies,  but  similar  to  Studios  I  to  3  in  McBride  (1977),  the 
present  studies  did  not  use  a  prestructured  item  pool.  Ratherl  the  tests  were 
simulated  using  a  perfect  and  infinite  item  pool  having  any  difficulty  parame¬ 
ters  required  by  the  item  selection  process,  with  restrictions  Vnly  on  the  item 
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discr initiations  and  pseudo-guessing  parameters ,  c.  By  thus  simulating  an  infi¬ 
nite  item  pool,  the  results  of  the  simulation  studies  should  reveal,  within  the 
Units  of  sampling  error,  the  inherent  properties  of  the  Bayesian  adaptive  test, 
unafffected  by  the  idiosyncrasies  of  a  typical  finite  item  pool. 

Similarly,  following  the  procedures  of  Study  4  in  McBride  (1977)  in  order 
to  permit  accurate  description  of  the  properties  of  the  testing  nsthod  as  they 
vary  with  trait  level,  the  ainulated  examinees  (simulees )  were  not  drawn  random¬ 
ly  from  a  specified  distribution;  rather,  a  large  number  of  exanlneea  were  simu¬ 
lated  at  each  of  a  number  of  trait  levels  throughout  the  normally  encountered 
range. 

Examinees 


For  the  purposes  of  monte  carlo  simulation,  an  examinee  i_  was  characterised 
by  a  numerical  value,  which  is  the  actual  trait  level  6.  In  each  of  the  eight 
data  sets  generated,  there  were  3,100  simulees,  with  100  at  each  of  31  6  levels 
equally  spaced  in  the  interval  -3.0  to  3.0.  This  range  of  the  trait  would  in¬ 
clude  99.99Z  of  a  population  normally  distributed  on  e,  with  mean  0  and  variance 
1. 

Test  Items 


i 

f 


f 

} 


I 

1 


I 


For  each  separate  item  administration,  an  item  was  computer  generated  with 
the  pseudo-guessing  (c)  parameter  held  constant  at  .20,  simulating  a  five-alter¬ 
native  multiple-choice  item.  The  item  discrimination,  a,  was  constant  for  each 
data  set,  with  a  ■  .60,  1.60,  or  2.40  between  data  sets. 

Following  McBride  (1977)  the  difficulty  (b)  parameter  for  each  simulated 
item  administration  was  determined  by  the  current  0  (the  prior  mean  MB_|  of  the 

estimated  distribution  of  0^  before  administering  the  mth  item)  and  by  the  con¬ 
stant  item  parameters  ag  and  bg,  according  to  the  formula 

,  +  (1  +  8c 

bg  *  Vi  -  rfr  lo*[ - 2 — *-]  in 

Equation  1  gives  the  item  difficulty  value  having  maximal  Information  when  0^  - 
Mg„|,  and  ag  and  cg  are  fixed  (Birnbaum,  1968,  p.  464).  Since,  in  general, 6^  is 
unknown  and  the  best  available  estimate  is  MB_1 ,  the  item  difficulty  chosen  is 

the  one  that  is  the  most  informative,  given  the  current  estimate  of  0  at  any 
point  in  the  adaptive  test. 


Item  Responses 

The  dichotomous  (0,1)  score  of  any  slmulee  on  any  item  is  a  probabilistic 
function  of  its  status  8^  on  the  trait  8,  the  item  difficulty  bg,  and  the  param¬ 
eters  ag  and  cg.  The  probability  Pg( 0^)  of  a  correct  response  (Ug  *  1)  under 
the  logistic  model  item  characteristic  curve  is 


P8(ei)  "  Cg  +  (1‘cg)/{1  +  exp[“1*7*g(0i-bg)]  }  * 


[2] 
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In  order  to  simulate  iten  reaponaes,  each  tine  an  item  adninlatratlon  took 
place  the  quantity  n*  eonpered  with  a  peeudo-randon  nunber  r^  generat¬ 

ed  fron  a  distribution  uni fora  in  the  Interval  [0,11.  A  score  of  Ug  •  1  was 

assigned  whenever  P'(0.)  equaled  or  exceeded  r  otherwise,  a  score  of  0  was 
assigned.  8  8 

Dependent  Variables 

For  the  sinulsted  test  of  each  individual  1,  the  following  were  recorded: 
k,  the  nunber  of  iteas  administered; 

,  the  posterior  mean  after  k  itens  (i.e.,  6);  and 

V^,  the  posterior .variance  after  k  itens  (i.e.,  the  variance  of  0). 

These  values  were  averaged  at  each  level  of  8  across  the  100  simulees  at  that 
level,  resulting  in  §it  the  mean  of  the  6  estimates  at  each  level  of  0±(i  -  1, 

2,  ...»  31),  and  a2(0i>*  the  variance  of  0  at  each  8  level.  Bias  was  determined 

at  each  of  the  0  levels  by 

Bias  -  (Q±  -  8i)  131 

Information  was  computed  from  the  formula 


KOj)  - 


where  0’  is  the  first  derlvate  of  the  polynomial  regression  of  0  on  0. 
Independent  Variables 


Eight  data  sets  were  analysed  for  three  levels  of  item  discrimination.  The 
characteristics  of  the  three  studies  and  the  data  sets  are  summarised  in  Table 
1. 

Study  I:  Accurate  prior  g  estimate.  This  study  was  intended  to  provide 
"best  case"  data  in  order  to  serve  as  a  benchmark  against  which  other  studies 
could  be  evaluated.  The  "best  case"  for  the  Bayesian  adaptive  test  ought  to  be 
one  involving  a  "perfect"  item  pool  and  accurate  prior  knowledge  about  examin¬ 
ees'  trait  levels.  Accurate  prior  knowledge  means  that  each  examinee's  trait 
level  wes  known  beforehand  and  was  used  as  the  mean  of  the  Bayes  prior  distribu¬ 
tion.  Under  these  conditions  the  only  limitations  on  the  information  and  accu¬ 
racy  of  estimate  of  Owen's  procedure  are  those  imposed  by  the  test  length,  and 
by  the  discriminations  and  guessing  parameters  of  the  simulated  test  items.. 
Bolding  those  variables  constant,  any  idiosyncrasies  in  the  behavior  of  the  test 
scores  must  be  due  to  the  trAit  level  estimation  and  item  difficulty  selection 
procedure. 

Two  separate  and  independent  test  administrations  wets  simulated  for  each 
of  the  3,100  simulees  t  In  Data  Set  1,  all  item  discriminations  ware  .80,  aud  io 
Data  Set  2,  a  -  1.60.  For  each  simulee,  the  Bayes  initial  prior  distribution 
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Table  1 

Smeary  of  the  Independent  Variables 
in  the  Three  Studies 


Study  and 
Data  Set 

£ 

Prior 

Distribution 
Mean  Variance 

Termination 
Criterion 
Posterior  No.  of 
Variance  Items 

Study 

1 

I 

.80 

ei 

1 

20 

2 

1.60 

ei 

1 

- 

20 

Study 

3 

II 

.80 

0 

1 

20 

4 

1.60 

0 

1 

- 

20 

5 

2.40 

0 

1 

- 

20 

Study 

6 

III 

.80 

0 

1 

.10 

30 

7 

1.60 

0 

1 

.10 

30 

8 

2.40 

0 

1 

.10 

30 

was  normal,  with  mean  6^  and  variance  1.0.  Thus,  at  the  outset  of  testing,  the 

Initial  estimate  of  each  simulee's  trait  level  was  accurate.  The  adaptive  test 
was  allowed  to  run  its  normal  course,  re-estimating  6^  after  every  item  response 
and  selecting  the  next  item  accordingly,  until  20  items  had  been  administered. 

Study  II:  Constant  prior  8  estimate  with  fixed  test  length.  Study  II  rep¬ 
licated  the  20-item  fixed  test  length  and  constant  £  values  of  .80  and  1.60  from 
Study  1;  to  examinee  effects  with  more  highly  discriminating  items.  Data  Set  5 
used  £  •  2.40  for  all  items,  while  Data  Sets  3  and  4  used  items  with  £  -  .80  and 
1.60  as  in  Study  I.  In  contrast  to  Study  I,  the  three  data  sets  of  Study  II  used 
the  same  initial  normal  prior  distribution  (mean  ->  0,  variance  -  1.0)  for  all 
8imulee8,  regardless  of  actual  trait  level.  In  this  study,  then,  a  more  typical 
use  of  the  Bayesian  adaptive  testing  strategy  was  simulated,  l.e.,  the  applica¬ 
tion  to  individuals  for  whom  no  prior  6  estimates  were  available  prior  to  test¬ 
ing;  consequently,  a  group  prior  6  distribution  was  used  to  select  the  first 
Item  to  be  administered.  As  in  Study  I,  a  fixed-length  test  of  20  items  was 
administered  to  each  aimulee. 

Study  III:  Constant  prior  6  estimate  with  variable  test  length.  In  Study 

III,  as  in  Study  II,  the  same  initial  normal  (0,1)  prior  distribution  was  as¬ 

sumed  for  all  simulees.  The  difference  between  the  studies  was  in  the  test  ter¬ 
mination  criterion.  In  Study  III,  testing  was  terminated  for  each  slmulee  when¬ 
ever  the  posterior  variance  fell  below  .10.  This  value  corresponds  to  the 

"standard  error  of  estimate”  criterion  of  .3162  specified  by  Orry  (1974)  to 
achieve  a  fidelity  coefficient  exceeding  .95  in  a  normal  (0,1)  population  of 
examinees.  A  maximum  test  length  of  30  items  was  imposed,  so  that  if  the  poste¬ 
rior  variance  criterion  had  not  been  reached  within  30  items,  testing  was  termi¬ 

nated.  As  for  Study  II,  three  levels  of  item  discrimination--*  -  .80,  1.60,  and 
2.40— were  studied  in  Data  Sets  6,  7,  and  8,  respectively. 


Results 


Accurate  Prior  9  Estimate 

Bias  of  the  ability  estimates  for  the  two  data  sets  of  Study  I  are  shown  in 
Figure  1  (numerical  values  of  bias  and  information  for  Data  Sets  1  and  2  are  in 
Appendix  Table  A).  As  Figure  1  shows,  there  was  virtually  no  bias  in  the  abili¬ 
ty  estimates  for  Data  Set  2  (_a  •  1*6),  with  a  small  amount  of  bias  alternating 
between  positive  bias  and  negative  bias  for  Data  Set  1  «*  .8).  Ihe  maximum 

amount  of  bias  observed  in  the  data  was  at  6  •  -1-3,  where  mean  bias  was  -.10;  a 
similar  degree  of  bias  was  observed  at  6  ■  -1.8. 

Figure  1 

Bias  as  a  Function  of  6  for  Data  Sets  1  and  2 


>  Data  Set  1  (a  =  .8) 

*  Data  Set  2  (a  =1.6) 


Figure  2  shows  information  curves  for  Data  Sets  1  and  2.  As  the  results 
show,  the  information  for  Data  Set  1  was  relatively  flat  throughout  the  6  range. 
The  maximum  information  was  observed  at  0  ■  -.5,  with  minimum  information  at  0  ■ 
+.2.  Information  ranged  between  7  and  11,  with  only  minor  variations  across  the 
ability  range.  The  information  for  Data  Set  2  was  relatively  flat,  but  not  as 
flat  as  that  for  Data  Set  1.  There  was  a  spike  at  e  -  . 8  with  a  secondary  peak 
at  6  ■  -2.8,  and  overall  more  variability  between  6  levels  than  for  Data  Set  1. 
In  general,  there  is  a  slight  concave  trend  to  the  information  values  for  Data 
Set  2,  with  the  exception  of  the  spike  at  6  *  .8.  However,  the  general  trend  is 
a  relatively  flat  information  function  for  both  data  sets* 
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Information  as  a 


>  Dau  8et  1  (a: 
■  Data  Set  2  («= 
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Constant  Prior  6  Estimate  with  Fixed  Test  Length 


Figure  3  shows  the  bias  in  the  6  estimates  for  the  data  sets  of  Study  II  at 
each  of  the  three  levels  of  item  discrimination  (numerical  values  of  bias  and 
information  are  la  Appendix  Table  B).  For  all  three  data  sets  there  is  a  nega¬ 
tive  slope  to  the  bias  curve  with  low  9  values  being  overestimated  and  higher  6 
values  being  underestimated*  In  addition,  there  are  some  substantial  differ¬ 
ences  in  the  bias  curves  for  the  three  levels  of  discrimination.  Data  Set  3  Ce 
-  .8)  achieved  the  highest  levels  of  bias  of  all  three  data  sets.  Very  severe"** 
bias  was  observed  for  negative  6  levels  and  severe  bias  in  the  opposite  direc¬ 
tion  for  positive  8  levels.  When  item  discriminations  were  increased  in  Data 
Set  4,  there  was  only  a  slight  drop  in  the  positive  bias  for  low  6  levels  and  a 
more  substantial  drop  in  negative  bias  for  the  8  levels  above  the  mean.  In¬ 
creasing  the  item  discriminations  to  2.4  in  Data  Set  5  resulted  in  virtually  no 
change  in  bias  for  low  8  level  but  a  further  decrease  in  bias  for  the  positive  8 
levels  with  the  range  of  unbiased  ability  estimates  varying  from  approximately  8 
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-  -1  to  0  ■  +1.5  in  Data  Set  5.  As  these  results  show,  the  effect  of  increasing 
item  discrimination  is  to  reduce  bias  somewhat ,  primarily  for  high  6  levels. 

For  low  e  levels  (  <  -2.0)  substantial  levels  of  bias  (.20  or  more)  were  ob¬ 
served  for  the  highly  discriminating  items  of  Data  Set  5. 

Figure  3 

Bias  as  a  Function  of  6  for  Data  Sets  3,  4,  and  5 

***  *  •  Data  Set  3  (  a  -  .8 ) 

*- — -*  Data  Sat  4  (a =1.6) 

»■  o  Data  Sat  5  (a  =2.4) 


Figure  4  shows  test  information  curves  for  the  three  date  sets  ef  Study  2. 
As  Figure  4  shows,  with  the  low  discriminating  items  ( a  ■  .8)  of  Bets  let  3, 
test  information  is  relatively  flat  for  6  levels  above- about  •  •  -1.3,  with  a 
decrease  in  Information  below  that  level.  As  item  discrimination  is  Increased , 
the  results  for  Data  Set  4  show  the  information  curve  peaking  with  relatively 
lower  Information  levels  for  6  >  1.6  and  6  <  -1.3,  and  a  greater  asymmetry  in 
the  information  curve.  Finally,  when  the  items  of  Data  Set  3  (a  ■  2,4)  were 
used,  the  Information  curve  becomes  even  more  peaked  and  more  variable ,  with 
high  levels  of  information  generally  in  the  range  of  •  ■  +1  to  -1,  cad  with  in¬ 
formation  dropping  off  extremely  quickly  beyond  that  range.  For  6  levels  below 


-1,  there  Is  little  difference  in  information  when  item  discriminations  are  in¬ 
creased  from  £ -  1.6  to  £  •  2.4.  For  8  levels  below  -1.8,  levels  of  information 
are  not  increased  by  Increasing  item  discriminations. 


Figure  5 

Bias  as  a  Function  of  8  for  Data  Sets  6,  7,  and  8 
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Coast* nt  Prior  6  Eitlitt  With  Variable  Teat  Length 

Figure  5  show  bias  functions  for  the  three  date  sets  of  Study  III  (numerl- 
csl  values  for  bias  and  information  are  in  Appendix  Tables  C,  D,  and  E).  As  the 
results  show,  least  bias  for  low  6  levels  was  observed  for  Data  Set  6  ( |a  •  *8), 
while  the  high  8  levels  obtained  the  highest  degree  of  bias  for  that  data  set. 

As  item  discriminations  Increased,  bias  for  low  0  levels  Increased,  while  bias 
for  the  high  8  levels  decreased.  Extreaely  high  levels  of  bias  were  observed 
for  Data  Set  7  ( a_ -  1.6)  and  Data  Set  8  (a  -  2.4)  for  8  levels  less  than  8  •  -2. 

Figure  6  shows  test  information  functions  for  the  variable-length  condi¬ 
tions  of  Data  Sets  6  through  8.  The  information  function  that  most  approximated 
the  horizontal  and  equi precise  ideal  was  achieved  by  Data  Set  6  (£  -  .8),  which 
obtained  relatively  constant  levels  of  information  for  8  values  greater  than  8  - 
-1.5.  As  item  discrimination  was  Increased,  the  level  of  information  obtained 
for  low  8  levels  decreased,  irtiile  the  level  of  information  obtained  for  high  8 
levels  remained  similar.  The  result  of  increasing  item  discrimination  was  a 
general  increase  in  peakedness  and  asymmetry  of  the  test  information  functions. 


Figure  6 

Information  as  a  Function  of  8  for  Data  Sets  6,  7,  and  8 


Figure  7  shows  the  mean  number  of  items  administered  for  each  of  the  6  lev¬ 
els  for  the  data  sets  of  Study  III  (numerical  values  are  in  Appendix  Tables  C, 

D,  and  E).  As  expected,  more  items  were  needed  in  Data  Set  6,  which  had  lower 
item  discriminations,  than  in  Data  Sets  7  and  8.  The  results  show  that  in  Data 


Set  6,  30  items  wee  generally  not  sufficient,  on  the  average,  for  the  adaptive 
test  to  achieve  the  specified  level  of  poaterlor  variance  (.10)  for  most  test 
lengths.  The  results  also  show  that  test  length  required  was  an  increasing 
function  of  6  for  Data  Sets  7  and  8.  While,  on  the  average,  the  posterior  vari¬ 
ance  termination  criterion  of  .10  was  achieved  with  about  8.5  items  for  low  6 
values  in  Data  Set  7,  twice  the  number  of  items  (17.0)  were  necessary  to  achieve 
the  same  posterior  variance  termination  criterion  (on  the  average)  for  6  -  +3. 
The  same  trend  was  observed  for  the  more  highly  discriminating  items  of  Data  Set 
8. 


Figure  7 

Mean  Number  of  Ztesm  Administered  as  a  Function  of  0 
for  Data  Sets  6,  7,  and  8 


Discussion  and  Conclusions 

This  study  used  a  "perfect"  item  pool  in  order  to  evaluate  the  performance 
of  Owen's  Bayesian  adaptive  testing  strategy  under  ideal  conditions.  The  re¬ 
sults  show  that  in  terms  of  achieving  statistically  unbiased  measurement  and 
measurements  of  equal  precision  throughout  the  range  of  ability,  Owen's  adaptive 
testing  strategy  achieves  these  desirable  goals  only  under  the  extremely  unreal- 
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istlc  condition  of  an  accurate  prior  ability  estimate.  Of  course,  in  a  real la- 
tic  testing  situation,  the  examinee 'a  ability  la  not  knows  beforehand;  other¬ 
wise,  testing  would  not  be  necessary.  Thus,  the  data  of  Study  1  serve  only  ee 
an  unrealistic  baseline  condition  to  which  results  of  other  acre  realistic  test¬ 
ing  conditions  can  be  coupe red.  Even  under  the  unrealistic  conditions  of  Study 
1,  however,  there  was  a  tendency  for  increasing  ltea  discrimination  to  result  in 
increasing  variability  in  levels  of  information  as  a  function  of  0. 

Studies  II  and  III  evaluated  Owen's  Bayesian  testing  strategy  under  the 
■ore  realistic  testing  conditions  of  a  constant  prior  8  estlaate,  with  both  fix¬ 
ed  and  variable  test  length.  The  results  of  Studies  2  and  3  show  that  this 
adaptive  testing  strategy  does  not  achieve  unbiased  ueaaureuent  or  ae as ureaents 
of  equal  precision  when  a  constant  prior  8  estlaate  is  used  for  all  examinees, 
regardless  of  whether  test  length  is  fixed  or  variable.  The  results  show  an 
interaction  of  the  termination  criterion  with  the  performance  of  the  adaptive 
testing  strategy,  both  in  terms  of  bias  and  information. 

When  a  constant  test  length  is  used,  increasing  item  discrimination  results 
in  decreased  bias,  with  a  more  substantial  decrease  in  bias  for  high  0  levels. 
When  variable  termination  is  used,  increasing  item  discrimination  results  in 
only  slightly  decreased  bias  for  high  8  levels,  but  in  Increased  bias  for  low  8 
levels,  with  extremely  high  levels  of  bias  for  very  low  8  levels.  In  terms  of 
Information,  the  flattest  information  curves  were  observed  for  both  termination 
criteria  with  the  least  discriminating  items.  As  item  discrimination  was  in¬ 
creased,  in  both  cases  the  information  curve  became  more  peaked  and  asymmetric, 
with  a  greater  degree  of  asymmetry  observed  for  the  variable-length  testing  con¬ 
dition.  Results  also  showed  that  different  mean  numbers  of  items  were  necessary 
to  achieve  a  fixed  posterior  variance  termination  criterion  at  different  levels 
of  e.  With  moderately  and  highly  discriminating  items  (£  -  1.6  and  _a  »  2.4), 
twice  the  number  of  items  were  necessary,  on  the  average,  for  high  8  levels  to 
reach  a  posterior  variance  termination  criterion  of  .10  than  for  low  0  levels. 

Because  this  study  used  s  perfect  item  pool  in  which  items  of  a  specified 
discrimination  were  available  at  any  level  of  difficulty,  the  results  observed 
in  these  studies  cannot  be  attributed  to  deficiencies  in  the  item  pool,  as  might 
be  the  case  for  the  results  reported  by  Gorman  (1980).  Rather,  these  results 
are  attributable  to  the  effect  of  the  constant  prior  6  estimate,  as  is  shown  by 
the  comparison  of  results  between  Studies  II  and  III  and  those  of  Study  I.  Al¬ 
though  the  effect  of  Urry's  (1977)  correction  for  regression  was  not  explicitly 
examined  in  these  studies,  it  is  unlikely  that  it  would  have  the  desired  effects 
under  both  the  fixed-length  and  variable-length  test  condition,  since,  as  indi¬ 
cated,  there  was  Interaction  of  observed  bias  with  the  termination  criterion. 

Although  a  major  purpose  of  adaptive  testing  is  to  provide  measurements 
with  equal  precision/information  at  all  levels  of  the  ability  continuum  (Helss, 
1982),  results  of  these  analyses  show  that  under  the  realistic  conditions  of  a 
constant  prior  8  estimate,  Owen's  Bayesian  adaptive  testing  strategy  does  not 
achieve  this  desirable  goal.  Since  the  test  information  curves  utilise  some  of 
the  same  data  from  which  the  bias  curves  were  computed,  the  results  for  informa¬ 
tion  are  in  a  sense  a  consequence  of  the  bias  in  the  8  estimates.  The  data  from 
these  three  studies  show  that  the  bias  reaults  from  use  of  a  constant  prior  8 
estimate.  Further  research  will  be  necessary  to  determine  whether  aad  to  What 


degree  the  nee  of  veriebie  prior  0  eetiaetee  will  effect  the  performance  of 
Owen's  adaptive  testing  strategy  in  terns  of  reducing  the  bias  and,  consequent¬ 
ly  ,  improving  the  equlprecislon  of  its  ability  estimates. 
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Appendix:  Supplementary  Tables 


Table  A 

Mean  and  Variance  of  0,  Bias  and  Information,  as  a  Function  of  6 
for  the  Data  Seta  of  Study  I 


Data  Set  I _  Data  Set  2 


£ _  In  for-  _ | _  Infor- 


0 

Mean  Variance 

Bias 

nation 

Mean  Variance  Bias 

matlon 

-3.0 

-3.040 

.124 

-.04 

7.669 

-3.002 

.044 

.00 

22.253 

-2.8 

-2.778 

.125 

.02 

7.656 

-2.836 

.037 

-.04 

26. 509 

-2.6 

-2.564 

.148 

.04 

6.  504 

-2.604 

.046 

.00 

21.359 

-2.4 
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Table  D 

Mean  and  Variance  of  8,  Bias,  Information, 
and  Mean  and  Standard  Deviation  of  Humber  of 
Items  Administered  as  a  Function  of  8 
for  Data  Set  7 


6 

8 

Bias 

Infor¬ 

mation 

Ho.  of 

Items 

Mean 

Variance 

Mean 

S.D. 

-3.0 

-1.742 

.221 

1.26 

.001 

8.37 

.90 

-2.8 

-1.675 

.233 

1.12 

.035 

8.49 

.85 

-2.6 

-1.752 

.150 

.85 

.237 

8.41 

.76 

-2.4 

-1.762 

.152 

.64 

.523 

8.52 

.82 

-2.2 

-1.661 

.108 

.54 

1.263 

8.65 

.77 

-2.0 

-1.488 

.205 

.51 

.992 

8.96 

.86 

-1.8 

-1.478 

.139 

.32 

1.997 

9.30 

.91 

-1.6 

-1.333 

.139 

.29 

2.565 

9.45 

.75 

-1.4 

-1.241 

.110 

.16 

3.978 

9.85 

.77 

-1.2 

-1.108 

.107 

.09 

4.846 

10.03 

.77 

-1.0 

-.955 

.103 

.04 

5.801 

10.15 

.77 

-.8 

-.760 

.082 

.04 

8.202 

10.62 

.81 

-.6 

-.596 

.085 

.00 

8.731 

10.74 

.77 

-.4 

-.402 

.077 

.00 

10.451 

11.16 

.88 

-.2 

-.213 

.060 

-.01 

14. 320 

11.56 

.93 

0.0 

-.028 

.099 

-.03 

9.135 

11.81 

.96 

.2 

.195 

.071 

.00 

13.234 

11.91 

.98 

.4 

.354 

.085 

-.05 

11.342 

12.28 

.84 

.6 

.459 

.081 

-.05 

12.068 

12.60 

.80 

.8 

.762 

.084 

-.04 

11.661 

12.76 

.83 

1.0 

.930 

.110 

-.07 

8.820 

12.91 

.88 

1.2 

1.153 

.046 

-.05 

20.645 

12.98 

.68 

1.4 

1.303 

.071 

-.10 

12.934 

13. 36 

.83 

1.6 

1.504 

.076 

-.10 

11.534 

13.65 

.91 

1.8 

1.638 

.078 

-.16 

10.582 

13.86 

1.00 

2.0 

1.827 

.101 

-.17 

7.580 

14.47 

.92 

2.2 

1.994 

.080 

-.21 

8.730 

14.58 

.93 

2.4 

2.210 

.089 

-.19 

7.024 

15.13 

.82 

2.6 

2.407 

.109 

-.19 

5.022 

15.51 

.86 

2.8 

2.490 

.055 

-.31 

8.490 

15.72 

.65 

3.0 

2.675 

.063 

-.33 

6.121 

16.17 

.87 
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Table  E 

Mean  and  Variance  of  6,  Bias,  Info  nation, 
and  Mean  and  Standard  Deviation  of  Nuaber  of 
Iteas  Administered  as  a  Function  of  6 
for  Data  Set  8 
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8 

Infor¬ 

mation 

No.  of 

I  tens 

Mean 

Variance  Bias 

Mean 

S.D. 

-3.0 

-1.485 

.216 

1.51 

.417 

5.33 

.57 

-2.8 

-1.473 

.230 

1.33 

.117 

5.31 

.54 

-2.6 

-1.466 

.183 

1.13 

.007 

5.29 

.55 

-2.4 

-1.432 

.284 

.97 

.026 

5.31 

.54 

-2.2 

-1.528 

.178 

.67 

.222 

5.22 

.50 

-2.0 

-1.439 

.185 

.56 

.503 

5.55 

.58 

-1.8 

-1.354 

.193 

.45 

.844 

5.44 

.59 

-1.6 

-1.345 

.113 

.26 

2.168 

5.50 

.56 

-1.4 

-1.227 

.113 

.17 

2.964 

5.67 
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-1.056 

.108 

.14 
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.139 
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6.15 
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-.8 
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.03 
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6.39 

.69 

-.6 

-.615 

.095 

-.01 

7.419 

6.  50 

.75 

-.4 

-.409 

.090 

-.01 

8.725 

6.95 

.86 

-.2 

-.240 

-  .087 

-.04 

9.841 

7.28 

.78 

0.0 

-.048 

.078 

-.05 

1 1. 742 

7.43 

.67 

.2 

.157 

.084 

-.04 

11.463 

7.61 

.61 

.4 

.368 

.079 

-.03 

12.611 

7.93 

.65 

.6 

.548 

.070 

-.05 

14.501 

8.01 

.68 

.8 

.794 

.082 

-.01 

12.427 

8.27 

.83 

1.0 

.956 

.070 

-.04 

14.400 

8.25 

.73 

1.2 

1.111 

.071 

-.09 

13.834 

8.48 

.77 

1.4 

1.299 

.071 

-.10 

13.272 

8.78 

.88 

1.6 

1.519 

.064 

-.08 

13.892 

9.23 

.86 

1.8 

1.708 

.085 

-.09 

9. 693 

9.56 

.72 

2.0 

1.859 

.100 

-.14 

7.482 

9.83 

.72 

2.2 

2.099 

.071 

-.10 

9.353 

10.26 

.74 

2.4 

2.224 

.069 

-.18 

8.312 

10.61 

.82 

2.6 

2. 393 

.059 

-.21 

8.124 

11.10 

.89 

2.8 

2. 517 

.060 

-.28 

6.404 

11.44 

.80 

3.0 

2.605 

.047 
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11.75 
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